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I Abstract 

^^ We present a path algorithm for the generahzed lasso problem. This problem penalizes 

^N the £i norm of a matrix D times the coefficient vector, and has a wide range of applications, 

{^JQ dictated by the choice of D. Our algorithm is based on solving the dual of the generalized lasso, 

^ which facilitates computation and conceptual understanding of the path. For D — I (the usual 

lasso), we draw a connection between our approach and the well-known LARS algorithm. For 

an arbitrary D, we derive an unbiased estimate of the degrees of freedom of the generalized 

lasso fit. This estimate turns out to be quite intuitive in many applications. 
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(y-j 1 Introduction 

'^i Regularization with the £i norm seems to be ubiquitous throughout many fields of mathematics and 

C^ engineering. In statistics, the best-known example is the lasso, the application of an £i penalty to 

H Hnear regression [3T1[7]. Let y G R" be a response vector and X £ II"^p be a matrix of predictors. If 

'— ' the response and the predictors have been centered, we can omit an intercept term from the model, 

and then the lasso problem is commonly written as 



> 



in 



X 



1, 



minimize ;!^||y-X/3||2 + A||/3||i, (1) 



where A > is a tuning parameter. There are many fast algorithms for solving the lasso (nj) at a 
single value of the parameter A, or over a discrete set of parameter values. The least angle regression 
^^ (LARS) algorithm, on the other hand, is unique in that it solves ([IJ^ for all A G [0, oo] [11] (see also 

f^ the earlier homotopy method of [23], and the even earlier work of [3]). This is possible because the 

lasso solution is piecewise linear with respect to A. 

The LARS path algorithm may provide a computational advantage when the solution is desired at 
many values of the tuning parameter. For large problems, this is less likely to be the case because the 
number of knots (changes in slope) in the solution path tends to be very large, and this renders the 
^ path intractable. Computational efficiency aside, the LARS method fully characterizes the tradeoff 

between goodness-of-fit and sparsity in the lasso solution (this is controlled by A), and hence yields 
interesting statistical insights into the problem. Most notably, the LARS paper established a result 
on the degrees of freedom of the lasso fit, which was further developed by [33] . 

The first of its kind, LARS inspired the development of path algorithms for various other opti- 
mization problems that appear in statistics [TBI ESI [EJ [20] , and our case is no exception. In this 
paper, we derive a path algorithm for problems that use the £i norm to enforce certain structural 
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constraints — instead of pure sparsity — on the coefficients in a linear regression. These problems are 
nicely encapsulated by the formulation: 



... i| 



minimize -||y-X/3||2 + A||D/3||i, (2) 



where D e ]^™xp jg ^ specified penalty matrix. We refer to problem ([2]) as the generalized lasso. 
Depending on the application, we choose D so that sparsity of Dj3 corresponds to some other desired 
behavior for /?, typically one that is structural or geometric in nature. In fact, various choices of D 
in (pi) give problems that are already well-known in the literature: the fused lasso, trend filtering, 
wavelet smoothing, and a method for outlier detection. We derive a simple path algorithm for the 
minimization (pi) that applies to a general matrix D, hence this entire class of problems. Like the 
lasso, the generalized lasso solution is piecewise linear as a function of A. We also prove a result 
on the degrees of freedom of the fit for a general D. It is worth noting that problem (pi) has been 
considered by other authors, for example |28]. This last work establishes some asymptotic properties 
of the solution, and proposes a computational technique that relates to simulated annealing. 

The paper is organized as follows. We begin in Section p] by motivating the use of a penalty 
matrix D, offering several examples of problems that fit into this framework. Section [3] explains that 
some instances of the generalized lasso can be transformed into a regular lasso problem, but many 
instances cannot, which emphasizes the need for a new path approach. In Section |4] we derive the 
Lagrange dual of (pi), which serves as the jumping point for our algorithm and all of the work that 
follows. For the sake of clarity, we build up the algorithm over the next 3 sections. Sections [5] and 
[6] consider the case X = I. In Section [5] we assume that D is the 1-dimensional fused lasso matrix, 
in which case our path algorithm takes an especially simple (and intuitive) form. In Section [6] we 
give the path algorithm for a general penalty matrix D, which requires adding only one step in the 
iterative loop. Section [7] extends the algorithm to the case of a general design matrix X. Provided 
that X has full column rank, we show that our path algorithm still applies, by rewriting the dual 
problem in a more familiar form. We also outline a path approach for the case when X has rank 
less than its number of columns. Practical considerations for the path's computation are given in 
Section [1 

In Section [9] we focus on the lasso case, D = I, and compare our method to LARS. Above, 
we described LARS as an algorithm for computing the solution path of (fTl). This actually refers 
to LARS in its "lasso" state, and although this is probably the best- known version of LARS, it is 
not the only one. In its original (unmodified) state, LARS does not necessarily optimize the lasso 
criterion, but instead performs a more "democratic" form of forward variable selection. It turns out 
that with an easy modification, our algorithm gives this selection procedure exactly. In Section [T0| 
we derive an unbiased estimate of the degrees of freedom of the fit for a general matrix D. The 
proof is quite straightforward because it utilizes the dual fit, which is simply the projection onto a 
convex set. As we vary D, this result yields interpretable estimates of the degrees of freedom of the 



fused lasso, trend filtering, and more. Finally, Section 11 contains some discussion. 



To save space (and improve readability) , many of the technical details in the paper are deferred 
to a supplementary document, available online at http://www-stat . Stanford. edu/-ryaJitibs/[ 

2 Applications 

There are a wide variety of interesting applications of problem (pi) . What we present below is not 
meant to be an exhaustive list, but rather a set of illustrative examples that motivated our work on 
this problem in the first place. This section is split into two main parts: the case when X = I (often 
called the "signal approximation" case), and the case when X is a general design matrix. 



2.1 The signal approximation case, X = I 

When X — I, the solution of the lasso problem (fTl) is given by soft-thresholding the coordinates of 
y. Therefore one might think that an equally simple formula exists for the generalized lasso solution 
when the design matrix is the identity — but this is not true. Taking X = J in the generalized lasso 
(pi) gives an interesting and highly nontrivial class of problems. In this setup, we observe data y G R" 
which is a noisy realization of an underlying signal, and the rows oi D G R™^" reflect some believed 
structure or geometry in the signal. The solution of problem (pi) fits adaptively to the data while 
exhibiting some of these structural properties. We begin by looking at piecewise constant signals, 
and then address more complex features. 

2.1.1 The fused lasso 

Suppose that y follows a 1-dimensional structure, that is, the coordinates of y correspond to succes- 
sive positions on a straight line. If D is the (n — 1) x n matrix 



Did = 



-1 1 
-1 1 














-1 1 



(3) 



then problem (pi) penalizes the absolute differences in adjacent coordinates of /3, and is known as the 
Id fused lasso |32j . This gives a piecewise constant fit, and is used in settings where coordinates in the 
true model are closely related to their neighbors. A common application area is comparative genomic 
hybridization (CGH) data: here y measures the number of copies of each gene ordered linearly along 
the genome (actually y is the log ratio of the number of copies relative to a normal sample), and 
we believe for biological reasons that nearby genes will exhibit a similar copy number. Identifying 
abnormalities in copy number has become a valuable means of understanding the development of 
many human cancers. See Figure [T] for an example of the Id fused lasso applied to some CGH data 
on glioblastoma multiformes, a particular type of malignant brain tumor, taken from [5j. 
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Figure 1: The Id fused lasso applied to some glioblastoma multiforme data. The red line represents the 
inferred copy number from the Id fused lasso solution (for A — 3). 



A natural extension of this idea penalizes the differences between neighboring pixels in an image. 
Suppose that y represents a noisy image that has been unraveled into a vector, and each row of D 
again has a 1 and —1, but this time arranged to give both the horizontal and vertical differences 



between pixels. Then problem ^ is called the 2d fused lasso [35], and is used to denoise images 
that we believe should obey a piecewise constant structure. This technique is a special type of 
total variation denoising, a well-studied problem that carries a vast literature spanning the fields 
of statistics, computer science, electrical engineering, and others (for example, see [H]). Figure [2] 
shows the 2d fused lasso applied to a toy example. 
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(a) Original 



(b) Noisy 



(c) Denoised 



Figure 2: An example of the 2d fused lasso for image denoising. We started with a toy signal, shown in 
(a). The colors green, blue, purple, red m the image correspond to the numeric levels 1,2,3,4, respectively. 
We then added noise, shown m (h), interpolating between colors to display the intermediate values. This is 
used as the data y in the 2d fused lasso problem. The solution (for X = 1) is shown m (c), and it is a fairly 
accurate reconstruction. The fused lasso is effective here because the original image is piecewise constant. 



We can further extend this idea by defining adjacency according to an arbitrary graph structure, 
with n nodes and m edges. Now the coordinates oi y € K," correspond to nodes in the graph, and 
we penalize the difference between each pair of nodes joined by an edge. Hence D is m x n, with 
each row having a —1 and 1 in the appropriate spots, corresponding to an edge in the graph. In this 
case, we simply call problem ^ the fused lasso. Note that both the Id and 2d fused lasso problems 
are special cases of this, with the underlying graph a chain and a 2d grid, respectively. But the fused 
lasso is a very general problem, as it can be applied to any graph structure that exhibits a piecewise 
constant signal across adjacent nodes. See Figure |3] for application in which the underlying graph 
has US states as nodes, with two states joined by an edge if they share a border. This graph has 48 
nodes (we only include the mainland US states) and 105 edges. 

The observant reader may notice a discrepancy between the usual fused lasso definition and ours, 
as the fused lasso penalty typically includes an additional term ||/3||i, the £i norm of the coefficients 
themselves. We refer to this as the sparse fused lasso, and to represent this penalty we just append 
the n X n identity matrix to the rows of D. Actually, this carries over to all of the applications 
yet to be discussed — if we desire pure sparsity in addition to the structural behavior that is being 
encouraged by D, we append the identity matrix to the rows of D. 





(a) Data 



(b) Fused Lasso Solution 



Figure 3: An example of the fused lasso on an irregular graph. The data y are the log proportion of HlNl 
flu cases for each (mainland) US state in the year 2009, shown m (a). This was taken from /6/. The color 
map uses white to reflect the lowest measured log proportion, and dark red to reflect the highest, with yellow, 
orange, and red in between. We can think of the data as noisy measurements of the true log probabilities 
of infection in each state, which likely exhibits some geographic trend. Therefore, we solve the fused lasso 
problem on a custom underlying graph, where we connect two states by an edge if they share a border. Shown 
in (b) is the solution (for A = 0.25/ Here, groups of states are assigned the same color or "fused" on the west 
coast, m the mid west, in the south east, and in the north east. The colors suggest that, among these regions, 
you are most likely to get HlNl flu if you live m the north east, then the west coast, then the midwest, and 
then the south east. But there certainly are states that don't get fused into these regions, like Wisconsin and 
Illinois, where the infection rates are exceptionally high. 



2.1.2 Linear and polynomial trend filtering 

Suppose again that y follows a l-dimensional structure, but now D is the (n — 2) x n matrix 
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Then problem ([2]) is equivalent to linear trend fluttering (also called £i trend filtering) |21j . Just as 
the Id fused lasso penalizes the discrete first derivative, this technique penalizes the discrete second 
derivative, and so it gives a piecewise linear fit. This has many applications, namely, any settings 
in which the underlying trend is believed to be linear with (unknown) changepoints. Moreover, by 
recursively defining 

At,fe = -Did ■ Dti.k-i for fc = 2, 3, . . . , 



(here Did is the in — k — \) x (n — k) version of &) wc can fit a piecewise polynomial of any order 
k, further extending the realm of applications. Wc call this polynomial trend filtering of order k. 
Figure |4] shows examples of linear, quadratic, and cubic fits. 

The polynomial trend filtering fits (especially for fc = 3) are similar to those that one could obtain 
using regression splines and smoothing splines. However, the knots (changes in fcth derivative) 
in the trend filtering fits are selected adaptively based on the data, jointly with the inter-knot 
polynomial estimation. This phenomenon of simultaneous selection and estimation — analogous to 
that concerning the nonzero coefficients in the lasso fit, and the jumps in the piecewise constant 
fused lasso fit — does not occur in regression and smoothing splines. Regression splines operate on 
a fixed set of knots, and there is a substantial literature on knot placement for this problem (see 
Chapter 9.3 of |17j, for example). Smoothing splines place a knot at each data point, and implement 
smoothness via a generalized ridge regression on the coefficients in a natural spline basis. As a result 
(of this £2 shrinkage), they cannot represent both global smoothness and local wiggliness in a signal. 
On the other hand, trend filtering has the potential to represent both such features, a property 
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(a) Linear 



(b) Quadratic 
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Figure 4: Solutions of S for three problems, with D equal to (a) -Dtf,i, (b) -Dtt.2, cind (c) -Dtf,3- These 
are piece-wise linear, quadratic, and cubic, respectively. (For each problem we chose a different value of the 
regularization parameter X.) 

called "time and frequency localization" in the signal processing field, though this idea has been 
largely unexplored. The classic example of a procedure that allows time and frequency localization 
is wavelet smoothing, discussed next. 

2.1.3 Wavelet smoothing 

This is a quite a popular method in signal processing and compression. The main idea is to model 
the data as a sparse linear combination of wavelet functions. Perhaps the most common formulation 
for wavelet smoothing is SURE shrinkage [9, which solves the lasso optimization problem 



... 1,1 
minimize -\\v - 



we\\l + \\\ 



(4) 



where W G R"^" has an orthogonal wavelet basis along its columns. By orthogonality, we can 
change variables to /? = W9 and then Q becomes a generalized lasso problem with D ~ W^ . 

In many applications it is desirable to use an overcomplete wavelet set, so that W G R"^™ with 
n < m. Now problem Q and the generalized lasso ^ with D = W'^ (and X = I) are no longer 
equivalent, and in fact give quite different answers. In signal processing, the former is called the 
synthesis approach, and the latter the analysis approach, to wavelet smoothing. Though attention 
has traditionally been centered around synthesis, a recent paper by [12] suggests that synthesis may 
be too sensitive, and shows that it can be outperformed by its analysis counterpart. 

2.2 A general design matrix X 

For any of the fused lasso, trend filtering, or wavelet smoothing penalties discussed above, the 
addition of a general matrix X of covariates significantly extends the domain of applications. For a 
fused lasso example, suppose that each row of X represents a ki x k2 x k^ MRI image of a patient's 
brain, unraveled into a vector (so that p = ki ■ k^ ■ k^). Suppose that y contains soine continuous 
outcome on the patients, and we model these as a linear function of the MRIs, E{yi\Xi) = f3'^Xi. 
Now /3 also has the structure of a fci x fc2 x ^3 image, and by choosing the matrix D to give the sparse 
3d fused lasso penalty (that is, the fused lasso on a 3d grid with an additional £1 penalty of the 
coefficients) , the solution of ([2| attempts to explain the outcome with a small number of contiguous 
regions in the brain. 

As another example, the inclusion of a design matrix X in the trend filtering setup provides an 
alternative way of fitting varying- coefficient models ^8l[8j. We consider a data set from [18], which 



6 



examines n = 88 observations on the exhaust from an engine fueled by ethanol. The response y is 
the concentration of nitrogen dioxide, and the two predictors are a measure of the fuel-air ratio -E, 
and the compression ratio of the engine C . Studying the interactions between E and C leads the 
authors of 1 18 to consider the model 



^{yi\E,,C,) ^ ME^) + Pi{E,) ■ C 



(5) 



This is a linear model with a different intercept and slope for each Ei, subject to the (implicit) 
constraint that the intercept and slope should vary smoothly along the E'i's. We can fit this using 
(I2]) , in the following way: first we discretize the continuous observations i?i , . . . En so that they fall 
into, say, 25 bins. Our design matrix X is 88 x 50, with the first 25 columns modeling the intercept 
/3o and the last 25 modeling the slope /3i. The ith row of X is 



Xij — 



1 if Ei lies in the jth bin 

d if Ei lies in the {j + 25)th bin 

otherwise. 



D 



Af.3 





Finally, we choose 



Af,3 

where -Dtf,3 is the cubic trend filtering matrix (the choice i'tf.s is not crucial and of course can be 
replaced by a higher or lower order trend filtering matrix.) The matrix D is structured in this way so 
that we penalize the smoothness of the first 25 and last 25 components of /3 = (/Sg, PiY' individually. 
With X and D as described, solving the optimization problem (l2| gives the coefficients shown in 
Figure pi which appear quite similar to those plotted in |18j . 
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Figure 5: The intercept and slope of the varying- coefficient model ([5| for the engine data of }18f . fit using 
H with a cubic trend filtering penalty matrix (and X = 3). The dashed lines show 85% bootstrap confidence 
intervals from 500 bootstrap samples. 



We conclude this section with a generalized lasso application of i29i . in which the penalty is not 
structurally-based, unlike the examples discussed previously. Suppose that we observe j/i, . . . y„, and 
we believe the majority of these points follow a linear model E(j/j|Xi) = /3'^Xi for some covariates 
Xi — {Xii, . . . XipY^ , except that a small number of the j/j are outliers and do not come from this 
model. To determine which points are outliers, one might consider the problem 



minimize 



2" 



XPWl subject to \\z - vWq < k 



(6) 



for a fixed integer k. Here ||a;||o = J2i 1(^« 7^ 0). Thus by setting k = 3, for example, the solution 
z of ([6]) would indicate which 3 points should be considered outliers, in that Zi ^ yi for exactly 3 
coordinates. A natural convex relaxation of problem ([6]) is 

minimize -llz — Xi3|L + Allz — 7/111 , (7) 

where we have also transformed the problem from bound form to Lagrangian form. Letting a = y— z, 
this can be rewritten as 

minimize -||y — a — X/3||2 + A||a||i, (8) 



aGE",0GRP 2 



which fits into the form of problem pi, with design matrix X = [I X], coefficient vector /3 = (a, P^ 
and penalty matrix D = \I 0]. Figure p^ shows a simple example with p = 1. 
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Figure 6: A simple example of using problem, (pi to perform, outlier detection. Written in the form m^, the 
blue line denotes the fitted slope J3, while the red circles indicate the outliers, as determtned by the coordinates 
of a that are nonzero (for X — 8). 

After reading the examples in this section, a natural question is: when can a generalized lasso 
problem (pi) be transformed into a regular lasso problem (fll)? (Recall, for example, that this is 
possible for an orthogonal D, as we discussed in the wavelet smoothing example.) We discuss this 
in the next section. 



3 When does a generalized lasso problem reduce to a lasso 
problem? 

If D is p X p and invertible, we can transform variables in problem (pi) by 6* = D(3, yielding the lasso 
problem 

minimize -lly - XD-'^eil^ + XMU. (9) 

More generally, ii D is m x p and rank(£') — m (note that this necessarily means m < p), then we 
can still transform variables and get a lasso problem. First we construct a, p x p matrix D = . 



with rank(_D) = p, by finding a (p — m) x p matrix A whose rows are orthogonal to those in D. Then 
we change variables to 6* = {6i, 62)'^ = D(3, so that the generalized lasso ^ becomes 

... 1, 



minimize -\\y — XD 



ewi + MM 



(10) 



This is almost a regular lasso, except that the £1 penalty only covers part of the coefficient vector. 
First write XD^^O = Xi9i + X292; then, it is clear that at the solution the second block of the 
coefficients is given by a linear regression: 

92 = {X^X2)-'X^iy-XA). 



Therefore we can rewrite problem ( 10 1 as 



minimize 1\\{I - P)y - {I - P)Xi9i\\l + X\\ei\\i, 



(11) 



where P 



XI 



{XIX2) 



Xj, the projection onto the column space of X2- The LARS algorithm 



provides the solution path of such a lasso problem (11), from which we can back-transform to get 
the generalized lasso solution: /3 = D^^9. 

However, if D is m x p and rank(D) < m, then such a transformation is not possible, and LARS 
cannot be used to find the solution path of the generalized lasso problem ^. Further, in this case, 
the authors of [12] establish what they call an "unbridgeable" gap between problems (II]) and (l2|, 
based on the geometric properties of their solutions. 

While several of the examples from Section [2] satisfy rank(D) = m, and hence admit a lasso 
transformation, a good number also fall into the case rank(Z)) < m, and suggest the need for a novel 
path algorithm. These are summarized in Table IT] Therefore, in the next section, we derive the 
Lagrange dual of problem ^ , which leads to a nice algorithm to compute the solution path of (pi) 
for an arbitrary penalty matrix D. 



rank(Z?) = m 



• The Id fused lasso 

• Polynomial trend filtering of any order 

• Wavelet smoothing with an orthogonal 
wavelet basis 

• Outlier detection 



rank(Z?) < m 

• The fused lasso on any graph that has 
more edges than nodes (for example, the 
2d fused lasso) 

• The sparse fused lasso on any graph 

• Wavelet smoothing with an overcomplete 
wavelet set 



Table 1: Examples from Section\E that fall into the cases rank(D) — m and rank(D) < m. 



4 The Lagrange dual problem 

Roughly speaking, the generalized lasso problem ^ is difficult to analyze directly because the 
nondifferentiable (,1 penalty is composed with a linear transformation of /3. We hence turn to the 
corresponding Lagrange dual problem where the story is conceptually clearer. First we consider the 
generalized lasso in the signal approximation case, X = I: 



minimize -lly — /?IU 

;3GR" 2" "^ 



\\\D(i\\ 



(12) 



Following an argument of |21) . we rewrite this problem as 



minimize ;:||j/ ~ /3||2 + A||z||i subject to z = D(3. 



The Lagrangian is hence 



£(/3, z, u) = l\\y- /3||2 + A||z||i + u^{Dp - z), 



and to derive the dual problem, we minimize this over f3, z. The terms involving /3 are just a 
quadratic, and up to some constants (not depending on u) 



-L,, ^,,o T^„^\ -L 



IZlll -u^z 



mm(-\\y-l3\\l + u^Dp] ^ --\\y - D 

while 

min I A | 



T ||2 
2) 



if ||u||oo <A 

— oo otherwise. 



Therefore the dual problem of ( 12 1 is 



minimize -\\y — D ujlj subject to ||u||oo < A. (13) 



Immediately we can see that (13 1 has a "nice" constraint set, {u : Halloo < A}, which is simply a 



box, free of any linear transformation. It is also important to note the difference in dimension: the 



dual problem has a variable u G R™, whereas the original problem (12), called the primal problem, 
has a variable (3 e R" . 

When rank(Z)) < m, the dual problem is not strictly convex, and so it can have many solutions. 
On the other hand, the primal problem is always strictly convex and always has a unique solution. 
The primal problem is also strictly feasible (it has no constraints), and so strong duality holds 



(see Section 5.2 of [3]). Let us denote the solutions of the primal problem (12) and dual problem 



(13) by (3x and u\, respectively, to emphasize their dependence on A. By taking the gradient of 
the Lagrangian £{P,z,u) with respect to 13 and setting this equal to zero, we get the primal-dual 
relationship 

Px^y-D'^ux. (14) 

Taking the gradient of C{l3, z, u) with respect to z and setting this equal to zero, we find that each 
coordinate i = 1, . . . tti of the dual solution satisfies 

■{+A} ii{D^xh>0 
UA,^ e <( {-A} if {D^xh < (15) 

-A, A] if {DPxh^O. 

This last equation tells us that the dual coordinates that are equal to A in absolute value, 

B = {^■.\ux,^\^X}, (16) 

are the coordinates of D/3x that are "allowed" to be nonzero. But this does necessarily mean that 
{Dl3x)i ^ for all i e B. 

For a general design matrix X, we can apply a similar argument to derive the dual of (pi): 

minimize -{X'^y - D^u)'^ {X^ X)+{X^y - D'^u) (17) 

subject to ||u||oo < A, D'^u e tow{X), 
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This looks complicated, certainly in comparison to problem (13 1. However, the inequality constraint 
on u is still a simple (un-transformed) box. Moreover, we can make (171 look like (13) by changing 
the response y and penalty matrix D. This will be discussed later in Section [7] 

In the next two sections. Sections |5] and |6j we restrict our attention to the case X — I and 



derive an algorithm to find a solution path of the dual (13). This gives the desired primal solution 
path, using the relationship ( |14[ ). Since our focus is on solving the dual problem, we write simply 
"solution" or "solution path" to refer to the dual versions. Though we will eventually consider an 
arbitrary matrix D in Section l6l we begin by studying the Id fused lasso in Section [5] This case is 
especially simple, and we use it to build the framework for the path algorithm in the general D case. 

5 The Id fused lasso 

In this setting we have D = Did, the {n — 1) x n matrix given in ([3]). Now the dual problem (13) 
is strictly convex (since Di^ has rank equal to its number of rows), and therefore it has a unique 
solution. In order to efficiently compute the solution path, we use a lemma that allows us, at different 
stages, to reduce the dimension of the problem by one. 

5.1 The boundary lemma 

Consider the constraint set {u : |ju||oo < A} C Il"~^: this is a box centered around the origin with 
side length 2A. We say that coordinate i of u is "on the boundary" (of this box) if \ui\ — A. For the 
Id fused lasso, it turns out that coordinates of the solution that are on the boundary will remain on 
the boundary indefinitely as A decreases. This idea can be stated more precisely as follows: 

Lemma 1 (The boundary lemma). Suppose that D = I?id, the Id fused lasso matrix in ^. For 



any coordinate i, the solution ux of (13) satisfies 

u\o,i = Ao ^ uxA = A for all < X < Xq, 

and 

u\Q,i = — Ao ^ u\_i = —A for all < A < Aq. 

The proof is given in the online supplement. It is interesting to note a connection between the 
boundary lemma and a lemma of [14j . which states that 

K,. = /3ao,«+i ^ k^^ = k^i for all A > Ao (18) 

for this same problem. In other words, this lemma says that no two equal primal coordinates can 
become unequal with increasing A. In general \u\.i\ = A is not equivalent to {Df3\)i ^ 0, but these 
two statements are equivalent for the Id fused lasso problem (see the primal-dual correspondence in 



Section 5.3 1, and therefore the boundary lemma is equivalent to (18 1 



5.2 Path algorithm 

This section is intended to explain the path algorithm from a conceptual point of view, and no 



rigorous arguments for its correctness are made here. We defer these until Section 6.1 when we 
revisit the problem in the context of a general matrix D. 

The boundary lemma describes the behavior of the solution as A decreases, and therefore it is 
natural to construct the solution path by moving the parameter from A = oo to A = 0. As will be 
made apparent from the details of the algorithm, the solution path is a piecewise linear function of 
A, with a change in slope occurring whenever one of its coordinate paths hits the boundary. The key 
observation is that, by the boundary lemma, if a coordinate hits the boundary it will stay on the 
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boundary for the rest of the path down to A = 0. Hence when it hits the boundary we can essentially 
eliminate this coordinate from consideration (since we know its value at each smaller A), recompute 
the slopes of the other coordinate paths, and move until another coordinate hits the boundary. 

As we construct the path, we maintain two lists: B ~ 13{X), which contains the coordinates 
that are currently on the boundary; and s — s(A), which contains their signs. For example, if we 
have B{X) = (5,2) and s(A) = (—1,1), then this means that ua,5 = —A and ux,2 = A. We call the 
coordinates in B the "boundary coordinates" , and the rest the "interior coordinates" . Now we can 
describe the algorithm: 

Algorithm 1 (Dual path algorithm for the Id fused lasso). 

• Start with Aq = oo, B = $, and s — $. 

• Fork = 0, ...?l-2.• 



i. Compute the solution at Xk by least squares, as in (20 1. 



2. Continuing in a linear direction from the solution, compute Afc+i, when an interior coor- 



dinate will next hit the boundary, as in (21) and (22 I 



3. Add this coordinate to B and its sign to s. 

The algorithm's details appear slightly more complicated, but this is only because of notation. 
If i3 = (zi, . . . ifc), then we define for a matrix A and a vector x 



Ar- 



A, 



A,. 



and xg = (xij. 



where Ai is the ith row of A. In words: Ag gives the rows of A that arc in B, and xb gives 
the coordinates of x in B. We use the subscript —B, as in A_e oi' x^jg, to index over the rows 
or coordinates except those in B. Note that B as defined above (in the paragraph preceding the 
algorithm) is consistent with our previous definition (16), except that here we treat B as an ordered 
list instead of a set (its ordering only needs to be consistent with that of s). Also, we treat s as a 
vector when convenient. 

When A = oo, the problem is unconstrained, and so clearly B = $ and s ~ 0. But more generally, 
suppose that we are at the fcth iteration, with boundary set B = B{Xk) and signs s = s(Afc). By the 
boundary lemma, the solution satisfies 

ux,B = Xs for all A £ [0,Xk]- 



Therefore, for A < A^, we can reduce the optimization problem (13 1 to 
... 1, 



minimize -\\y - 
M_e 2 



HDrsY 



{D^Bfu^sWl 



subject to ||ii-elloo < A, 



(19) 



which involves solving for just the interior coordinates. By construction, ua^.-b lies strictly between 
— Afe and A^ in every coordinate. Therefore it is found by simply minimizing the objective function 



in (19), which gives the least squares estimate 



Xk{D^ 



(20) 



Now let a — Xkb denote the right-hand side above. For A < A^, the interior solution will continue to 
be ua,-b = a — Xb until one of its coordinates hits the boundary. This critical value is determined 
by solving, for each i, the equation a^ — Xbi — ±A; a simple calculation shows that the solution is 



t,.= 



{D^b{D-bV) 'd^bV 



b,±l 



{D_b{D-bV) 'd_b{DbVs 



±1 



(21) 
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(only one of +1 or —1 will yield a value ti G [0, Afc]), which we call the "hitting time" of coordinate 
i. We take A^+i to be maximum of these hitting times 



Afe+i = max ti 



(22) 



Then we compute 

ik+i = argmax U and Sk+i = sign(wAfc_^i,jfc+i), 

i 

and append ik+i and Sk+i to B and s, respectively. 

5.3 Properties of the solution path 

Here we study some of the path's basic properties. Again we defer any rigorous arguments until 
Section [6. 2 [ when we consider a general penalty matrix D. Instead we demonstrate them by way of 
a simple example. 

Consider Figure 7(a) which shows the coordinate paths u\^i for an example with n = 8. Recall 
that it is natural to interpret the paths from right to left (A = c» to A = 0). Initially all of the slopes 
are zero, because when A = oo the solution is just the least squares estimate {DD'^)^^Dy, which 
has no dependence on A. When a coordinate path first hits the boundary (the topmost path, drawn 
in red) the slopes of the other paths change, and they don't change again until another coordinate 
hits the boundary (the bottommost path, drawn in green), and so on, until all coordinates are on 
the boundary. 



d 




E 



\ 

/ 



0.0 0.2 0.4 0.6 0.8 1.0 
% 
(b) Primal 

Figure 7: Dual and primal coordinate paths for a small problem with n = 8. 

The picture suggests that the path u\ is continuous and piecewise linear with respect to A, with 
changes in slope or "kinks" at the values Ai, . . . A„_i visited by the algorithm. (Piecewise linearity 
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is obvious from the algorithm's construction of the path, but continuity is not.) This is also true in 
the general D case, although the solution path can have more than m kinks for anm x n matrix D. 
On the other hand. Figure [7(b) | shows the corresponding primal coordinate paths 

^x,i = {y- D'^u\)i- 

As Ux is a continuous piecewise linear function of A, so is /3a, again with kinks at Ai, . . . A„_i. In 
contrast to the dual versions, it is natural to interpret the primal coordinate paths from left to right, 
because in this direction the coordinate paths become adjoined, or "fused" , at some values of A. The 
primal picture suggests that these fusion values are the same as the kinks Ai, . . . A„_i, that is: 

• Primal-dual correspondence for the Id fused lasso. The values of A at which two primal 
coordinates fuse are exactly the values of A at which a dual coordinate hits the boundary. 

A similar property holds for the fused lasso on an arbitrary graph, though the primal-dual corre- 
spondence is a little more complicated for this case. 



Note that as A decreases in Figure 7(a) no dual coordinate paths leave the boundary. This is 
prescribed by the boundary lemma. As A increases in Figure [7(b)[ no primal two coordinates split 



apart, or "un-fuse". This is prescribed by a lemma of [H] that we paraphrased in (|T8|), and the two 
lemmas are equivalent. 

6 A general penalty matrix D 



Now we consider ( |13[ ) for general m x n matrix D. The first question that comes to mind is: does 
the boundary lemma still hold? If DD^ is diagonally dominant, that is 

(DD^),, > Y, I iDD'^),j I for i = 1, . . . m, (23) 

then indeed the boundary lemma is still true. (See the online supplement.) Therefore the path 
algorithm for such a D is the same as that presented in the previous section. Algorithm [l] 

It is easy to check the Id fused lasso matrix is diagonally dominant, as both the left- and right- 



hand sides of the inequality in (23) are equal to 2 when D = Di^. Unfortunately, neither the 2d 



fused lasso matrix nor any of the trend filtering matrices satisfy condition (23 1. In fact, examples 
show that the boundary lemma does not hold for these cases. However, inspired by the Id fused 
lasso, we can develop a similar strategy to compute the full solution path for an arbitrary matrix D. 
The difference is: in addition to checking when coordinates will hit the boundary, we have to check 
when coordinates will leave the boundary as well. 

6.1 Path algorithm 

Recall that we defined, at a particular A^, the "hitting time" of an interior coordinate path to the 
value of A < Afe at which this path hits the boundary. Similarly, let us define the "leaving time" 
of a boundary coordinate path to be the value of A < Afc at which this path leaves the boundary 
(we will make this idea more precise shortly). We call the coordinate with the largest hitting time 
the "hitting coordinate", and the one with the largest leaving time the "leaving coordinate". As 
before, we maintain a list B of boundary coordinates, and s contains their signs. The algorithm for 
a general matrix D is: 

Algorithm 2 (Dual path algorithm for a general D). 

• Start with k — 0, Xq — cx), B — 0, and s = 0. 
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• While Afc > 0; 

1. Compute a solution at Afc by least squares, as in (261 



2. Compute the next hitting time ft.fc+i, as in (27) and (28). 

3. Compute the next leaving time Ik+i, as in (29), (30 1, and (31) 
4- Set Afc-|_i 



inax{/ifc+i, Zfc+i}. If hk+i > Ik+i then add the hitting coordinate to B and 
its sign to s, otherwise remove the leaving coordinate from B and its sign from s. Set 
k = k + \. 

Although the intuition for this algorithm comes from the Id fused lasso problem, its details 
are derived from a more technical point of view, via the Karush-Kuhn- Tucker (KKT) optimality 
conditions. For our problem (13 1, the KKT conditions are 



DD^ux-Dy + aj = Q, 



where u\,a,j are subject to the constraints 



||wa||oo < A 
a >0 

a-(lltiAlU-A)=0 
Il7lli<l 

j'^Ux = ||ua||oo- 



(24) 



(25a) 
(25b) 
(25c) 
(25d) 
(25e) 



Constraints ( |25d[ ), (25el say that 7 is a subgradient of the function a; 1— )■ ||a;||oo evaluated a.t x = u\. 
Subgradients are a generalization of gradients to the case of nondifferentiable functions — for an 
overview, see [5]. 



A necessary and sufficient condition for ux to be a solution to (13) is that ux,ct,l satisfy (24) 



and (25a)-(25e) for some a and 7. The basic idea is that hitting times are events in which (25a) 



is violated, and leaving times are events in which (25b)-(25e) are violated. We now describe what 
happens at the fcth iteration. At A = Aj., the path's solution is ux^.B — ^^fe* for the boundary 
coordinates, and the least squares estimate 



u\. 



-B 



= {D^BiD- 



B) 



\T\ + 



D^siy-XkiDefs) 



(26) 



for the interior coordinates. Here A'^ denotes the (Moore-Penrose) pseudoinverse of a matrix A, 
which is needed as D may not have full row rank. Write uXk,-B = a — Xkb. Like the Id fused lasso 
case, we decrease A and continue in a linear direction from the interior solution at Afc, proposing 
ux-B ^ a — Xb. We first determine when a coordinate of a — A6 will hit the boundary. Setting 
Oi— > Xbi — ±A and solving for A gives the hitting time of the ith interior coordinate. 



.(hit) 



[D^B{D^BVYD^By 



b,±l 



{D-B{D-BVyD_B{DBVs 



±1 



(only one of +1 or —1 will yield a value in [0, Afc].) Hence the next hitting time is 



hk+i = max t] 



(hit) 



(27) 



(28) 



The new step is to determine when a boundary coordinate will next leave the boundary. After ex- 



amining the constraints (25b)-(25e), we can express the leaving time of the zth boundary coordinate 



15 



by first defining 



Cj — Si ' 



and tficn tlic leaving time is 



Tlierefore the next leaving time is 



Db 


I-{D_Bf{D_BiD-Bf)^D_B 


y 


i 


Db 


I-{D-Bf{D_BiD-Bf)^D_B 


{DBfs 


(leave) J Ci/di if Ci < and di < 


^i 


1 otherwise. 







(29) 



lk-\-i = max t 



(leave) 



(30) 



(31) 



The last step of the iteration moves until the next critical event — hitting time or leaving time, 
whichever happens first. We can verify that the path visited by the algorithm satisfies the KKT 



conditions (24 1 and (25aH(25e) at each A, and hence is indeed a solution path of the dual problem 



(13|. This argument, as well a derivation of the leaving times given in (29l and (30), can be found 



in the online supplement. 



6.2 Properties of the solution path 

Suppose that the algorithm terminates after T iterations. By construction, the returned solution 
path u\ is piecewise linear with respect to A, with kinks at Ai > . . . > Ay. Continuity, on the other 



hand, is a little more subtle: because of the specific choice of the pseudoinverse solution in (26), the 



path ux is continuous over A. (When A does not have full column rank, there are many minimizers 
of \\z — Ax\\2, and x = {A'^ A)'^ A'^ z is only one of them.) The proof of continuity appears in the 
online supplement. 



Since the primal solution path /3x can be recovered from ux by the linear transformation (14), 
the path /3x is also continuous and piecewise linear in A. The kinks in this path are necessarily a 
subset of {Ai, . . . \t}- However, this could be a strict inclusion as rank(£') could be < m, that is, 
D^ could have a nontrivial null space. So when does the primal solution path change slope? To 
answer this question, it helps to write the solutions in a more explicit form. 

For any given A, let B — B{X) and s = s(A) be the current boundary coordinates and their signs. 
Then we know that the dual solution can be written as 



ux.B = As 

Ux.-B^{D-BiD-Bf)^D^B{y 



\[D„ 



This means that the dual fit D^ux is just 



D^ux = {DBfuxM + {D-Bfux,-B = \{DBfs + P,,^^D_^-^{y-\{DBfs), 



(32) 



where Pm denotes the projection operator onto a linear subspace M (here the row space of D-b)- 
Therefore, applying (14), the primal solution is given by 



/3a = (/-Prow(D_H))(y-A(^e)^s) = PnnHD_^){v-KDBfs). 



(33) 



Equation (33) is useful for several reasons. Later, in Section 10 we use it along with a geometric 



argument to prove a result on the degrees of freedom of (ix- But first, equation (33) can be used 
to answer our immediate question about the primal path's changes in slope: it turns out that j3x 
changes slope at Afe+i if mx\\{D _B(\k)) ¥" null(^-B(Afc+i)), that is, the null space of D^b changes 
from iterations fc to fc + 1. (The proof is left to the online supplement.) Thus we have achieved a 
generalization of the primal-dual correspondence of Section |5.3| 
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• Primal-dual correspondence for a general D. The values of A at which at which the primal 
coordinates changes slope are the values of A at which the null space of -D_b(a) changes. 

For various applications, the null space of I?_b can have a nice interpretation. We present the 
case for the fused lasso on an arbitrary graph Q, with m edges and n nodes. We assume without 
a loss of generality that Q is connected (otherwise the problem decouples into smaller fused lasso 
problems). Recall that in this setting each row of D gives the difference between two nodes connected 
by an edge. Hence the null space of D is spanned by the vector of all ones 

l = (l,l,...l)'^eR". 

Furthermore, removing a subset of the rows, as in D^js, is like removing the corresponding subset 
of edges, yielding a subgraph Q^b. It is not hard to sec that the dimension of the null space of Z)_e 
is equal to the number of connected components in G-b- tr fact, if G-b has connected components 
Ai, . . . Ak, then the null space of D_e is spanned by 1^^, . . . Ia^. G K,™, the indicator vectors on 
these components, 

(1a Jj = l(node j e A.^) for j = l,...n. 

When G-B has connected components Ai, . . . A^, the projection -Pnuii(Z)_B) performs a coordinate- 
wise average within each group Ai: 



Pnun(._.)(x) = E(%V^)-l-- 



Therefore, recalling (33), we see that coordinates of the primal solution /3x are constant (or in other 
words, fused) on each group A^. 

As A decreases, the boundary set B can both grow and shrink in size; this corresponds to adding 
an edge to and removing an edge from the graph G-B: respectively. Since the null space of D^b can 
only change when G-b undergoes a change in connectivity, the general primal-dual correspondence 
stated above becomes: 

• Primal-dual correspondence for the fused lasso on a graph. In two parts: 

(i) the values of A at which two primal coordinate groups fuse are the values of A at which 
a dual coordinate hits the boundary and disconnects the graph G-b(\)j 

(ii) the values of A at which two primal coordinate groups un-fuse are the values of A at which 
a dual coordinate leaves the boundary and reconnects the graph G^b(\)- 

Figure [8] illustrates this correspondence for a graph with n — 6 nodes and m = 9 edges. Note 
that the primal-dual correspondence for the fused lasso on a graph, as stated above, is consistent 
with that given in Section [573] This is because the Id fused lasso corresponds to a chain graph, so 
removing an edge always disconnects the graph, and furthermore, no dual coordinates ever leave the 
boundary by the boundary lemma. 

7 A general design matrix X 

In the last two sections, we focused on the signal approximation case X = I . In this section we 
consider the problem (pi) when X is a general n x p matrix of predictors (and I? is a general m x p 



penalty matrix). Our strategy is to again solve the equivalent dual problem (17). At first glance 



this problem looks much more difficult than the dual ( 13 ) when X ^ I. Moreover, the relationship 
between the primal and dual solutions is now 

px--iX^X)+{X^y^D^ux) + b, (34) 
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Figure 8: Dual and primal coordinate paths for the fused lasso applied to the shown graph structure. As A 
decreases, the first dual coordinate to hit the boundary is 119, but removing the corresponding edge doesn't 
disconnect the graph, so nothing happens in the primal setting. Then u% hits the boundary, and again, 
removing its edge doesn't affect the graph 's connectivity, so nothing happens. But when U5 hits the boundary 
next, removing its edge disconnects the graph (the node marked Ps becomes its own connected component), 
and hence two primal coordinate paths fuse. Note that us leaves the boundary at some point (the red dashed 
vertical line). Adding its edge reconnects the graph, and therefore two primal coordinates un-fuse. 
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where b G null(X), which is also more complicated. 



It turns out that we can make (17) look more familiar — that is, more like (13) — by defining 

y = XX+y and D = DX+. 



(35) 



(Here the pseudoinverse of a rectangular matrix A is A'^ = {A'^ A)^ A^ .) Abbreviating P = Pcoi{x) 
XX~^, the objective function in (17) becomes 



{X'^y - D'^uf{X'^X)+{X^y - D^u) = y^ Py - 2y'^b^u + u^DD'^u 

= {y-b'^ufP{y~b^u) 
= {y-b^ufp\y~b'^u) 
= (y-D^ufiy-D'^u). 

The first equality above is by the definition of D; the second holds because PD^ = D^\ the third 
is because P is idempotent; and the fourth is again due to the identity PD^ = D^ . Therefore we 



(36) 



can rewrite the dual problem (17) in terms of our transformed data and penalty matrix: 

minimize -\\y ^ D wjlj 
subject to ||w||oo < A, D^u e row(X), 
It is also helpful to rewrite the relationship ( [34| in terms of our new data and penalty matrix: 

p^^X+{y-b^u^) + h, (37) 

where h G null(X), which implies that the fit is simply 

XPx^y-b'^ux- (38) 



Modulo the row space constraint, D^u E row(X), problem (36 1 has exactly the same form as 
the dual ( 13 ) studied in Section l6] In the case that X has full column rank, this extra constraint 
has no effect, so we can treat the problem just as before. We discuss this next. 

7.1 The case rank(X) = p 

Suppose that rank(X) = p, so row(X) = W (note that this necessarily means p < n). Hence we 
can remove the constraint D"'"u e row(_X) from problem (36), so that it is really just the same as 
problem (13) that we solved in Section pi except with y,D replaced by y,D, respectively. Therefore 
we can apply Algorithm p^ to find a dual solution path u\. This gives the (unique) primal solution 
path using (37) with 6 = (since null(X) = {0}), or the fit using (38). 

Fortunately, all of the properties in Section [6?2] apply to the current setting as well. First, we 
know that the constructed dual path ux is continuous and piecewise linear, because we are using the 
same algorithm as before, just with a different data vector and penalty matrix. This means that /3\ 



is also continuous and piecewise linear, since it is given by the linear transformation (37). Next, we 
can follow the same logic in writing out the dual fit D"^u\ to conclude that 



/3a 



X+P 



lii(d_b; 



{y~\{D^r3fs), 



or 



^/^A-P,n(5_,)(y-A(i?_^,)^.). 
Hence — D^isXj3\ = D^^Px, which means that /3a G nul^i^-B), as before. 



(39) 



(40) 
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Though working with equations (391 and (40) may seem comphcated (as one would need to 



expand the newly defined variables y,D in terms of y,D), it is straightforward to show that the 
general primal-dual correspondence still holds here. This is given in the online supplement. That is: 
the primal path /3x changes slope at the values of A at which the null space of D_t}(^x) changes. For 
the fused lasso on a graph Q^ we indeed still get fused groups of coordinates in the primal solution, 
since j3\ e null(Z?_e) implies that Px is fused on the connected components of Q-b- Therefore, 
fusions still correspond to dual coordinates hitting the boundary and disconnecting the graph, and 
un-fusions still correspond to dual coordinates leaving the boundary and reconnecting the graph. 



7.2 The case rank(X) < p 

If rank(X) < p, then row(X) is a strict subspace of R''. One easy way to avoid dealing with the 
constraint D'^u G row(X) of (361 is to add an ^2 penalty to our original problem, giving 



1, 



mmimize - ry 



Xp\\i + \\\Dp\\,+em\i (41) 

where e > is fixed. This is analogous to the elastic net, which adds an £2 penalty to the lasso 



criterion 



Problem (41) can be rewritten as 
... 1 



minimize -\\y* - iX*)^ + M\Dl3\\u 



where y* — {y, 0)"^ and X* = , . Since rank(X*) = p, we can now use the strategy discussed 

in the last section, which is just applying Algorithm [2] to a transformed problem, to find the solution 
path of (41). Putting aside computational concerns, it may still be preferable to study problem (41 ) 
instead of the original generalized lasso problem ^. Some reasons are: 



the (unique) solution path of (41) may be more stable than the (non- unique) generalized lasso 



path, analogous to the behavior of the elastic net and lasso paths; 



the fit of (41 ) may actually outperform the generalized lasso fit in terms prediction error, again 



like the comparison between the elastic net and lasso prediction errors. 

Though adding an £2 penalty is easier and, as we suggested, perhaps even desirable, we can still 
solve the unmodified problem ^ in the rank(X) < p case, by looking at its dual (36). We only give 
a rough sketch of the path algorithm because in the present setting the solution and its computation 
are more complicated. 

We can rewrite the row space constraint in (36) as D^u _L null(X). Write P = Pnnii{X}j ^nd 
then problem ( 36 ) is 



minimize -\\y — D ujjj 

uSR™ 2 



(42) 



subject to 



< A, [DPfu = 0. 



To find a solution path of (42), the KKT conditions (24) and (25a)-(25e) need to be modified to 



incorporate the new equality constraint. These are now 

bb'^ux -Dy + a-f + DPS = 0, 



where ux, ct, 7, 5 are subject to the same constraints as before, (25a)-(25e), and additionally {DP)^ux 
0. Instead of simply using the appropriate least squares estimate at each iteration, we now need to 
solve for ux and 5 jointly. When A = 00, this can be done by solving the block system 



DD'^ 

{DPV 



DP 





Ux 
6 



Dy 




(43) 
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and in future iterations the expressions are similar. Having done this, satisfying the rest of the 



constraints ( 25a l~( 25e I can be done by finding the hitting and leaving times just as we did previously. 



8 Computational considerations 

Here we discuss an efhcient implementation of Algorithm [2J which gives the solution path of the 
signal approximation problem ( |12| , after applying the transformation (14 1 from dual to primal 
variables. For a design with rank(X) — p, we can modify y and X as in (35), and then the same 
algorithm gives the solution path of (l2]), this time relying on the transformation (37) (with 6 = 0) 
for the primal 

At each iteration of the algorithm, the dominant work is in computing expressions of the form 



for a vector x £ R", where B is the current boundary set (see equations (27) and (29)). Equivalently, 
the complexity of each iteration is based on finding 



argmin < ||f II2 : v £ argmin ||a; — {D-b) 



W\\2 



(44) 



the least squares solution with the smallest ^2 norm. In the next iteration, D^b has either one less 
or one more row (depending on whether a coordinate hit or left the boundary). 

We can exploit the fact that the problems ( 44 ) are highly related from one iteration to the next 
(our strategy that is similar to that in the LARS implementation). Suppose that when B = 0, we 
solve the problem (44) by using a matrix factorization (for example, a QR decomposition). In future 



iterations, this factorization can be efficiently updated after a row has been deleted from or added 
to D^B- This allows us to compute the new solution of (44) with much less work than it would take 
to solve the problem from "scratch" . 

Recall that Z? is m x n, and the dual variable u is ?7i-dimensional. Let T denote the number of 
iterations taken by the algorithm (note that T > m, and can be strictly greater if dual coordinates 
leave the boundary). When m < n, we can use a QR decomposition of Z?^ to compute the full dual 
solution path in 



0{t 



Tm^ 



operations. When m > n, using a QR decomposition of D allows us to compute the full dual solution 
path in 

operations. The main idea behind this implementation is fairly straightforward. However, the 



details are somewhat complicated because we require the minimum I2 norm solution ( 44 1 , instead 



of a generic solution, to the appropriate least squares problem at each iteration. See Chapters 5 and 
12 of [15j| for an extensive coverage of the QR decomposition. 
We mention two simple points to improve practical efficiency: 

• The algorithm starts at the fully regularized end of the path (A = cx)) and works towards the 
un-regularized solution (A = 0). Therefore, for problems in which the highly or moderately 
regularized solutions are the only ones of interest, the algorithm can compute only part of the 
path and terminate early. This could end up being a large savings in practice. 

• One can obtain an approximate solution path by not permitting dual coordinates to leave the 
boundary (achieved by setting l^+i == in Step 3 of Algorithmic]). This makes T — m, and so 
computing this approximate path only requires 0{mn^) or 0{m?n) operations when m < n 
OT m > n, respectively. This approximation can be quite accurate if the number times a dual 
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coordinate leaves the boundary is (relatively) small. Furthermore, its legitimacy is supported 
by the following fact: for D = J (the lasso problem) and rank(X) = p, this approximate path 
is exactly the LARS path when LARS is run in its original (unmodified) state. We discuss 
this in the next section. 

Finally, it is important to note that if one's goal is to find the solution of ( f 2 1 or (l2]) over a discrete 
set of A values, and the problem size is very large, then it is unlikely that our path algorithm is the 
most efficient approach. A reason here is the same reason that LARS is not generally used to solve 
large-scale lasso problems: the set of critical points (changes in slope) in the piecewise linear solution 
path /3x becomes very dense as the problem size increases. But there is a further disadvantage is 
that is specific to the generalized lasso (and not the lasso) . At the regularized end of the path (which 
is typically the region of interest), the generalized lasso solution is orthogonal to many of the rows 



of D; therefore we must solve large least squares problems, as in (26 1, in order to compute the path. 
On the other hand, at the regularized end of the lasso path, the solution lies in the span of a small 
number of columns of X; therefore we only need to solve small least squares problems to compute 
the path. 

For solving a large generalized lasso (or lasso) problem at a fixed A, it is preferable to use a 
convex optimization technique that was specifically developed for the purposes of computational 
efficiency. First-order methods, for example, can efficiently solve large-scale instances of (12) or ^ 
for A in a discrete set (see |I] as an example). Another optimization method of recent interest is 
coordinate descent [^ , which is quite efficient in solving the lasso at discrete values of A [T3] , and is 
favored for its simplicity. But coordinate descent cannot be used for the minimizations (12) and ([2|, 



because the penalty term ||-D/3||i is not separable in /?, and therefore coordinate descent does not 



necessarily converge. In the important signal approximation case (12), however, the dual problem 



(13) is separable, so coordinate descent will indeed converge if applied to the dual. Moreover, for 



various applications, the matrix D is sparse and structured, which means that the coordinate-wise 



updates for (13) are very fast. This makes coordinate descent on the dual a promising method for 



solving many of the signal approximation problems from Section 2.1 



9 Connection to LARS 

In this section, we return to the LARS algorithm, described in the introduction as a point of 
motivation for our work. Here we assume that D — I, so that (pi) is the same as the standard lasso 
problem (fTl); furthermore we assume that rank(X) = p. Our algorithm computes the lasso path /3a, 
via the dual path ux (using Algorithm pi applied to y, D as defined in (35 )). Another way of finding 



the lasso path is to use the LARS algorithm in its "lasso" mode. Since the problem is strictly convex 
{X has full column rank), there is only one solution at each A, so of course these two algorithms 
must give the same result. 

In its original or unmodified state, LARS returns a different path, obtained by selecting variables 
in order to continuously decrease the maximal absolute correlation with the residual (technically 
these are inner products with the residual, but they become correlations if we think of standardized 
data). We refer to this as the "LARS path". Interestingly, the LARS path can be viewed as an 
approximation to the lasso path (see [11] for an elegant interpretation and discussion of this). In our 
framework, we can obtain an approximate dual solution path if we never check for dual coordinates 
leaving the boundary, which can be achieved by dropping Step 3 from Algorithm p] (or more precisely, 
by setting Ik+i — for each k). If we denote the resulting dual path by u\, then this suggests a 
primal path 

^x = {X^Xy^iX^y-ux), (45) 



based on the transformation ( [34|) (noting that & = as null(A") — {0}). The question is: how does 
this approximate solution path /3x compare to the LARS path? 
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Figure [9] shows the two paths in question. On the left is the famihar plot of [TT] , showing the 
LARS path for the "diabetes data" . The colored dots on the x-axis mark when variables enter the 
model. The right plot shows our approximate solution path on this same data set, with vertical 
dashed lines marking when variables (coordinates) hit the boundary. The paths look identical, and 
this is not a coincidence: we show that our approximate path, which is given by ignoring dual 
coordinates leaving the boundary, is equal to the LARS path in general. 
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Figure 9: Companng the LARS path and our approximate lasso path, on the diabetes data. For this data 
set n — 442 and p — 10. The paths by parametrized by the l\ norm of their (respective) coefficient vectors. 



Lemma 2 (Equivalence to LARS). Suppose that rank(A) = p and consider using Algorithmic 
to compute an approximate lasso path in the following way: we use y = XX^y, D = X^ in place of 
J/, D, and we ignore Step 3 (that is, set Ik+i = Oj. Let u\ denote the corresponding dual path, and 



define a primal path j3\ according to (45 1. Then j3\ is exactly the LARS path. 



Proof. First define the residual rx — y — Xf3x. Notice that by rearranging (45 1, we get ux = X^ rx- 



Therefore, the coordinates of the dual path are equal to the correlations (inner products) of the 
columns of X with the current residual. Hence we have a procedure that: 

• moves in a direction so that the absolute correlation with the current residual is constant 
within B (and maximal among all variables) for all A; 

• adds variables to B once their absolute correlation with the residual matches that realized in 
B. 

This almost proves that (5x is the LARS path, with B being the "active set" in LARS terminology. 
What remains to be shown is that the variables not in B are all assigned zero coefficients. But, 
recalling that D = L, the same arguments given in Section |6.2| and Section |7.1| apply here to give 
that /3x G null(/_B) (really, ux still solves a sequence of least squares problems, and the only 
difference between ux and ux is in how we construct B). This means that /3a, -B = 0, as desired. D 
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10 Degrees of freedom 

In applied statistics, degrees of freedom describes the effective number of parameters used by a 
fitting procedure. This is typically easy to compute for linear procedures (linear in the data y) but 
difhcult for nonlinear, adaptive procedures. In this section, we derive the degrees of freedom of the 
fit of the generalized lasso problem, when rank(X) = p and D is an arbitrary penalty matrix. This 
produces corollaries on degrees of freedom for various problems presented in Section [2] We then 
briefly discuss model selection using these degrees of freedom results, and lastly we discuss the role 
of shrinkage, a fundamental property of £i regularization. 

10.1 Degrees of freedom results 

We assume that the data vector y e K," is drawn from the normal model 

y^N{^i,a'I), 

and the design matrix X is fixed (nonrandom). For a function g : R" -^ R", with ith coordinate 
function gi : R" — > R, the degrees of freedom of g is defined as 



n 



i=l 



For our problem, the function of interest is g{y) = Xf3\{y), for fixed A. 

An alternative and convenient formula for degrees of freedom comes from Stein's unbiased risk 
estimate [30] . li g is continuous and almost differentiable, then Stein's formula states that 

n 

- J2 Cov(g,(y), y,) = E[(V • g)iy)]. (46) 

i— 1 

Here V ■ g — ^"^^ dgi/dyi is called the divergence of 6. This is useful because typically the right- 
hand side of ( [46| is easier to calculate; for our problem this is the case. But using Stein's formula 
requires checking that the function is continuous and almost differentiable. In addition to checking 
these regularity conditions for g{y) = X/3\{y), we establish below that for almost every y the fit 



Xl3x{y) is a locally afhne projection. Essentially, this allows us to take the divergence in (33) when 



X = I, or (40 1 for the general X case, and treat B and s as constants. 

As in our development of the path algorithm in Sections [5J |6J and [7J we first consider the case 
X = I, because it is easier to understand. Notice that we can express the dual fit as D^u\{y) — 
Pcx{y)i the projection of y onto the convex polytope 

Ca = {D^u : hlU < A} C R". 



From (14 1, the primal solution is just the residual from this projection, I3\{y) = (/ — Pc^){y)- 
The projection map onto a convex set is always nonexpansive, that is, Lipschitz continuous with 
constant < 1. In fact, so is the residual from projecting onto a convex set (for example, see the 
proof of Theorem 1.2.2 in |27j). Therefore (3x{y) is nonexpansive in y, and hence continuous and 
almost differentiable (this follows from the standard proof a result called "Rademacher's theorem" ; 
for example, see Theorem 2 in Section 3.2 of [13 ). 

Furthermore, thinking geometrically about the projection map onto C\ yields a crucial insight. 
Examine Figure [TO] — as drawn, it is clear that we can move the point y slightly and it still projects 
to the same face of Cx. In fact, it seems that the only points y for which this property does not hold 
necessarily lie on rays that emanate orthogonally from the corners of C'x (two such rays are drawn 
leaving the bottom right corner). In other words, we are lead to believe that for almost every y, the 
projection map onto Cx is a locally constant affine projection. This is indeed true: 
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Lemma 3. For fixed X, there exists a set Af\ such that: 

(a) A/a has Hausdorjf dimension n — 1, hence Lehesgue measure zero; 

(b) for any y ^ J\f\ , there exists a neighborhood U of y such that Pc^ '■ U — > R" is simply the 
projection onto an affine subspace. In particular, the affine subspace is 



X{DBfs + TOw{D^B), 



(47) 



where B and s are the boundary set and signs for a solution ux{y) of the dual problem (13), 
B ^ {i : \ux.,{y)\ = A} and s = sigii(uA,e(j/))- 



The quantity (47) is well-defined in the sense that it is invariant under different choices of B 



and s (as the dual solution may not be unique). 
The proof, which follows the intuition described above, is given in the online supplement. 



X{DBfs + Tow{D^B) 



\ Px[y) 




Cx = {D^ 



<X] 



Figure 10: An illustration of the geometry surrounding ux and fix, for the case X = I . Recall that P\{y) = 
y—D^U\{y), where D'^u\{y) is the projection ofy onto the convex polytope C\ = {D^u : \\u\\ao < A}. Almost 
everywhere, small perturbations of y don't change the face on which its projection lies. The exceptional set 
A/a of points for which this property does not hold has dimension n — 1, and is a union of rays like the two 
drawn as dotted lines in the bottom right of the figure. 

Hence we have the following result: 



Theorem 1. For fixed A, the solution j3x of the signal approximation problem (12) has degrees of 
freedom 

Ai{fix) = E[nullity(i?_B(,))], 

where the nullity of a matrix is the dimension of its null space. The expectation here is taken over 
B{y), the boundary set of a dual solution ux{y)- 
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Note: above, we can choose any dual solution at y to construct the boundary set B{y), because 
by Lemmapl all dual solutions give rise to the same linear subspace null(I?_g(y)) (almost everywhere 
iny). 

Proof. Consider y ^ A/a, and let B and s be the boundary set and signs of a dual solution ux{y). 
By Lemma [3] there is a neighborhood U oi y such that 

/3a(2/') = {I-D^ux)iy') = PnuU(D_B)(2/' - HDsfs) 

for all y' G U. Taking the divergence at y we get 

(V • /3a) (y) = tr(P„,n(i3_B)) = nullity(i?_B), 

since the trace of a projection matrix is just its rank. This holds for almost every y because Afx has 
measure zero, and we can use Stein's formula to conclude that di{/3\) = E[nullity(il'_g(j,))]. D 

We now consider problem (pi), with the predictor matrix satisfying rank(X) — p, and it turns out 
that the same degrees of freedom formula holds for the fit X/3\. This is relatively straightforward 
to show, but requires sorting out the details of how to turn statements involving y, D into those 
involving y,D. First, by the same arguments as before, we know that X(5\{y) is nonexpansive as 
a function of y. But y — Pco\(x)iy) is nonexpansive in y, so XP\{y) is indeed nonexpansive, hence 
continuous and almost differentiable, as a function of y. 

Next we must establish that D^ux{y) is a locally affine projection for almost every y. Well, 
by Lemma bl this is true of D^ux{y) for y ^ A/a, so we have the desired result except on A^a == 
iPcoi{x))~^ i-^x) ■ Following the arguments in the proof of Lemma pi it is not hard to see that A/a 
now has dimension p — 1, so A^a has measure zero. 

With these properties satisfied, we have the general result: 

Theorem 2. Suppose that rank(Ar) — p. For fixed X, the fit XPx of the generalized lasso (pi) has 
degrees of freedom 

df(X/3A)-E[nullity(i?_e(,))], 
where B{y) is the boundary set of a dual solution u\{y). 

Note: as before, we can construct the boundary set B{y) from any dual solution at y, because 
by Lemma p^ the quantity null(Z?_B(y)) is invariant (almost everywhere in y). 

Proof. Let y ^ A^a- We show that (V • XPx){y) — nullity(D_g(y)), and then applying Stein's 
formula (along with the fact that A^a has measure zero) gives the result. 
Let B denote the boundary set of a dual solution u\ (y) . Then the fit is 

X$x{y) = Pnull{D^s)Pcol{X)y + c, 

where c denotes the terms that have zero derivative with respect to y. Using the fact null(X+) = 
nun(A:^), and that null(I)_B) D nun(A:+), 

"null(5_8) "col(X) = "nuU(D_B) ^ nu\\{D_e)P'nul\iX + ) 



^nun(D_8) -^nuU(X+)- 



Therefore, computing the divergence: 



(V • XPx)iy) = nullity(i?_BX+) - nullity(X+) 
= nullity(i:'_B), 

where the last equality follows because X has full column rank. This completes the proof. D 
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We saw in Section 6.2 that the null space of D has a nice interpretation for the fused lasso 



problem. In this case, the theorem also becomes easier to interpret: 

Corollary 1 (Degrees of freedom of the fused lasso). Suppose that rank{X) = p and that D 
corresponds to the fused lasso penalty on an arbitrary graph. Then for fixed X, the fit X/3\ of ([2]) 
has degrees of freedom 

di{X/3x) = E[ number of fused groups in (3x{y)]- 



Proof. If Q denotes the underlying graph, we showed in Section 6.2 that nullity (£'_b(j,)) is the 
number of connected components in G-B{y)- We also showed (see Section 7.1 for the extension to a 
general design X) that the coordinates of $\{y) are fused on the connected components of G-B{y)j 
giving the result. D 

By slightly modifying the penalty matrix, we can also derive the degrees of freedom of the sparse 
fused lasso: 

Corollary 2 (Degrees of freedom of the sparse fused lasso). Suppose that rank(X) = p and 
write Xi € MP for the ith row of X. Consider the sparse fused lasso problem: 

n p 

minimize ^ {y, - Xf pf + Ai j] | A | + A2 ^ \P^-P^, (48) 

where E is an arbitrary set of edges between nodes /3i, . . . /3p. Then for fixed Ai, A2, the fit X/3\-^^\^ 



of ( 48 ) has degrees of freedom 

di{X/3\-^^X2) = E[ number of nonzero fused groups in l3xi,\2{y) 



Proof. We can write ( 48 1 in the generalized lasso framework by taking A = A2 and 

D 



D 



fuse 

A2 



where -Dfuso is the fused lasso matrix corresponding to the underlying graph, with each row giving 
the difference between two nodes connected by an edge. 



In Section 6.2 we analyzed the null space of ZJfusc to interpret the primal-dual correspondence 
for the fused lasso. A similar interpretation can be achieved with D as defined above. Let Q denote 
the underlying graph and suppose that it has m edges (and p nodes) , so that I?fusc is ?7i x p and D is 
{m + p) X p. Also, suppose that we decompose the boundary set as B — Bi[JB2^ where Bi contains 
the dual coordinates in {1, . . . m] and B2 contains those in {m + 1, . . . m +p}. We can associate the 
first m coordinates with the m edges, and the last p coordinates with the p nodes. Then the matrix 
D-e defines a subgraph Q-b that can be constructed as follows: 

1. delete the edges of Q that correspond to coordinates in Bi, yielding Q-Bi\ 

2. keep only the nodes of Q-Bi that correspond to coordinates in B2^ yielding Q-b- 

It is straightforward to show that the nullity of Z3_e is the number of connected components in Q-b- 
Furthermore, the solution (3\-^_\^{y) is fused on each connected component of G-b and zero in all 
other coordinates. Applying Theorem [2] gives the result. D 

The above corollary proves a conjecture of |3^, in which the authors hypothesize that the degrees 
of freedom of the sparse Id fused lasso fit is equal to the number of nonzero fused coordinate groups, 
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in expectation. But Corollary [2] covers any underlying graph, which makes it a much more general 
result. 

By examining the null space of ZJ-e for other applications, and applying Theorem [2] one can 
obtain more corollaries on degrees of freedom. We omit the details for the sake of brevity, but list 
some such results in Table |2] along with those on the fused lasso for the sake of completeness. The 
table's first result, on the degrees of freedom of the lasso, was already established in ^. The results 
on trend filtering and outlier detection can actually be derived from this lasso result, because these 
problems correspond to the case rank(D) = m, and can be transformed into a regular lasso problem 
(11). For the outlier detection problem, we actually need to make a modification in order for the 



design matrix to have full column rank. Recall the problem formulation ([8|), where the coefficient 
vector is (a,/3)-^, the first block concerning the outliers, and the second the regression coefficients. 
We set ai = . . . = Qfp = 0, the interpretation being that we know a priori p points yi, . . .yp come 
from the true model, and only rest of the points yp+i, . . . y^ can possibly be outliers (this is quite 
reasonable for a method that simultaneous performs a p-dimensional linear regression and detects 
outliers). 



Problem 


Unbiased estimate of di{XP\) 


Lasso 


Number of nonzero coordinates 


Fused lasso 


Number of fused groups 


Sparse fused lasso 


Number of nonzero fused groups 


Polynomial trend filtering, order k 


Number of knots + fc + 1 


Outlier detection 


Number of outliers + p 



Table 2: Corollaries of Theorem\A giving unbiased estimates of Ai{Xj3\) for various problems discussed in 
Section^ These assume that rank(X) — p. 



10.2 Model selection 

Note that the estimates Table [2j are all easily computable from the solution vector (3x. The estimates 
for the lasso, (sparse) fused lasso, and outlier detection problems can be obtained by simply counting 
the appropriate quantities in f3\. The estimate for trend filtering may be difficult to determine 
visually, as it may be difficult to identify the knots in a piecewise polynomial by eye, but the knots 
can be counted from the nonzeros of Df3x. All of this is important because it means that we can 
readily use model selection criteria like Cp or BIC for these problems, which employ degrees of 
freedom to assess risk. For example, for the estimate X/3x of the underlying mean /i, the Cp statistic 
is 

Cp{X) ^\\y~ XPxWl - na^ + 2a^di{X^x), 

(see [5^ for the classic linear regression case), and is an unbiased estimate of the true risk E[||/i — 
X/^aIII] ■ Hence we can define 

dp{X) = \\y-XPx\\l- na^ + 2a2millity(D_B), 

replacing di{X(3x) by its own unbiased estimate nullity(£'_g). This modified statistic Cp{\) is still 
unbiased as an estimate of the true risk, and this suggests choosing A to minimize Cp{X). For this 
task, it turns out that Cp{X) obtains its minimum at one of the critical points {Ai, . . . At} in the 
solution path of Px- This is true because nullity (/?_b) is a step function over these critical points, 
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and the residual sum of squares \\y ~ X(ix\W is monotone nondecreasing for A in between critical 
points (this can be checked using ([40|). Therefore Algorithm [2] can be used to simultaneously 
compute the solution path and select a model, by simply computing Cp(Afe) at each iteration k. 

10.3 Shrinkage and the £i norm 

At first glance, the results in Table [2] seem both intuitive and unbelievable. For the fused lasso, 
for example, we are told that on average we spend a single degree of freedom on each group of 
coordinates in the solution. But these groups are being adaptively selected based on the data, so 
aren't we using more degrees of freedom in the end? As another example, consider the trend filtering 
result: for a cubic fit, the degrees of freedom is the number of knots + 4, in expectation. A cubic 
regression spline also has degrees of freedom equal to the number of knots + 4; however, in this 
case we fix the knot locations ahead of time, and for cubic trend filtering the knots are selected 
automatically. How can this be? 

This seemingly remarkable property — that searching for the nonzero coordinates, fused groups, 
knots, or outliers doesn't cost us anything in terms of degrees of freedom — is explained by the 
shrinking nature of the li penalty. Looking back at the criterion in (pi), it is not hard to see that 
the nonzero entries in Dfix are shrunken towards zero (imagine the problem in constrained form, 
instead of Lagrangian form). For the fused lasso, this means that once the groups are "chosen", 
their coefficients are shrunken towards each other, which is less greedy than simply fitting the group 
coefficients to minimize the squared error term. Roughly speaking, this makes up for the fact that 
we chose the fused groups adaptively, and in expectation, the degrees of freedom turns out "just 
right" : it is simply the number of groups. 

This leads us to think about the ^o-equivalent of problem ([2]), which is achieved by replacing 
the li norm by an £o norm (giving best subset selection when D = I). Solving this problem 
requires a combinatorial optimization, and this makes it difficult to study the properties of its 
solution in general. However, we do know that the solution of the Iq problem does not enjoy any 
shrinkage property like that of the lasso solution: if we fix which entries of D/? are nonzero, then the 
penalty term is constant and the problem reduces to an equality-constrained regression. Therefore, 
in light of our above discussion, it seems reasonable to conjecture that the £o fit has greater than 
E[nullity(D_B)] degrees of freedom. When D = I, this would mean that the degrees of freedom of 
the best subset selection fit is more than the number of nonzero coefficients, in expectation. 

11 Discussion 

We have studied a generalization of the lasso problem, in which the penalty is ||I?/3|| i for a matrix D. 
Several important problems (such as the fused lasso and trend filtering) can be expressed as a special 
case of this, corresponding to a particular choice of D. We developed an algorithm to compute a 
solution path for this general problem, provided that the design matrix X has full column rank. 
This is achieved by instead solving the (easier) Lagrange dual problem, which, using simple duality 
theory, yields a solution to the original problem after a linear transformation. 

Both the dual solution path and the primal (original) solution path are continuous and piecewise 
linear with respect to A. The primal solution /3a can be written explicitly in terms of the boundary 
set B, which contains the coordinates of the dual solution that are equal to ±A, and the signs of 
these coordinates s. Further, viewing the dual solution as a projection onto a convex set, we derived 
a simple formula for the degrees of freedom of the generalized lasso fit. For the fused lasso problem, 
this result reveals that the number of nonzero fused groups in the solution is an unbiased estimate 
of the degrees of freedom of the fit, and this holds true for any underlying graph structure. Other 
corollaries follow, as well. 
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An implementation of our path algorithm, following the ideas presented in SectionjSj is a direction 
for future work, and will be made available as an R package "genlasso" on the CRAN website j21] ■ 
There are several other directions for future research. We describe three possibilities below. 

• Specialized implementation for the fused lasso path algorithm. When D is the fused lasso matrix 
corresponding to a graph G, projecting onto the null space of D_b is achieved by a simple 
coordinate- wise average on each connected component of Q-b- It may therefore be possible 
(when X = /) to compute the solution path /?a without having to use any linear algebra, but 
by instead tracking the connectivity of Q. This could improve the computational efficiency 
of each iteration, and could also lead to a parallelized approach (in which we work on each 
connected component in parallel). 

• Number of steps until termination. The number of steps T taken by our path algorithm, for 
a general D, is determined by how many times dual coordinates leave the boundary. This is 
related to an interesting problem in geometry studied by [TU] , and investigating this connection 
could lead to a more definitive statement about the algorithm's computational complexity. 

• Connection to forward stagewise regression. When D = I (and rank(X) = p), wc proved that 
our path algorithm yields the LARS path (when LARS is run in its original, unmodified state) 
if we just ignore dual coordinates leaving the boundary. LARS can be modified to give forward 
stagewise regression, which is the limit of forward stepwise regression when the step size goes 
to zero (see [H]). A natural follow-up question is: can our algorithm be changed to give this 
path too? 

In general, we believe that Lagrange duality deserves more attention in the study of convex 
optimization problems in statistics. The dual problem can have a complementary (and interpretable) 
structure, and can offer novel mathematical or statistical insights into the original problem. 
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